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Abstract 

This survey paper is written with the intention of giving a mathematical introduction to filtering 
techniques for intermittent data assimilation, and to survey some recent advances in the field. The 
paper is divided into three parts. The first part introduces Bayesian statistics and its application to 
statistical inference and estimation. Basic aspects of Markov processes, as they typically arise from 
scientific models in the form of stochastic differential and/or difference equations, are covered in the 
second part. The third and final part describes the filtering approach to estimation of model states 
by assimilation of observational data into scientific models. While most of the material is of survey 
type, very recent advances in the field of nonlinear data assimilation covered in this paper include 
a discussion of Bayesian inference in the context of optimal transportation and coupling of random 
variables, as well as a discussion of recent advances in ensemble transform filters. References and 
sources for further reading material will be listed at the end of each section. 



1 



1 Introduction to Bayesian statistics 



In this section, we summarize the Bayesian approach to statistical inference and estimation, in which 
probability is interpreted as a measure of uncertainty (of the system state, for example). Contrary to 
closely related inverse problem formulations, all variables involved are considered to be uncertain, and 
are described as random variables. Furthermore uncertainty is only discussed in the context of available 
information, requiring the computation of conditional probabilities; Bayes' formula is used for statistical 
inference. We start with a short introduction to random variables. 

1.1 Preliminaries 

We start with a sample space fl which characterizes all possible outcomes of an experiment. An event is 
a subset of ft and we assume that the sot J" of all events forms a a-algebra (i.e., J" is non-empty, and 
closed over complementation and countable unions). For example, suppose that il = R. Then events 
can be defined by taking all possible countable unions and complements of intervals (a, b] c M; these are 
known as the Borel sets. 

Definition (Probability measure). A probability measure is a function F : T ^ [0, 1] with the following 
properties: 

(i) Total probability equals one: P(0) = 1. 

(ii) Probability is additive for independent events: If Ai, A2, . . . , An, ... is a finite or countable collection 
of events Ai G and A^ n = for i j, then 

¥{UiAi) = j2nAi) 

i 

The triple {fl, T, P) is called a probability space. 

Definition (Random variable). A function X : O ^ R is called a (univariate) random variable if 

{uefl: X{u)) < a;} e J" 
for all X S M. The (cumulative) probability distribution function of X is given by 

Fx{x) = F{{uj e n : X{oj) < x}). 
The cumulative probability distribution function implies a probability measure on R which we denote by 

Often, when working with a random variable X, the underlying probability space (fi,J^, P) is not 
emphasised; one typically only specifies the target space X = 'R and the probability distribution or 
measure fix on X. We then say that /Ux is the law of X and write X ~ fix- A probability measure fix 
introduces an integral over X and 

= / f{x)f,x{dx) 
Jx 

is called the expectation value of a function / : R — > R (/ is called a measurable function where the 
integral exists). We also use the notation law(X) = fix to indicate that fix is the probability measure 
for a random variable X. Two important choices for / are f{x) = x, which leads to the mean x = Mx[x] 
of X, and f{x) = {x — x)^, which leads to the variance cr^ = Ex[(x — x)^] of X. 

Univariate random variables naturally extend to the multivariate case, i.e. X = , N > 1. A 
probability measure fix on X is called absolutely continuous (with respect to the standard Lebesgue 
integral dx on R^) if there exists a probability density function (PDF) -jtx ■ X ^ R with Trx{x) > 0, and 

= / fix) fix (dx) = f /(x)7rx(x)dx, 
Jx 
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for all measurable functions /. The shorthand /Ltx(dx) = wxdx is often adopted. The implication is that 

one can, for all practical purposes, work within the classical Ricmann integral framework and docs not 
need to resort to Lebesgue integration. Again we can define the mean x G M.^ of a multivariate random 
variable and its covariance matrix 

P = Ex[{x - x){x - xf] e M^^^. 
Here denotes the transpose of a vector a. We now discuss a few standard distributions. 

Example (Gaussian distribution). We use the notation X ^ N(m,(T^) to denote a univariate Gaussian 
random variable with mean x and variance , with PDF given by 

-iTx{x) = — ^e-i^(^-®)', 



x e M. In the multivariate case, we use the notation X ^ N(a;, S) to denote a Gaussian random variable 
with PDF given by 

^^(^) = (2^)^/2|s|i/2 (-^(x - x)^J:-\x - x)^ , 

x e K". 

Example (Laplace distribution and Gaussian mixtures). The univariate Laplace distribution has PDF 

X G M. This may be rewritten as 

Jo v27rcr 

which is a weighted Gaussian PDF with mean zero and variance a^, integrated over a. Replacing the 
integral by a Riemann sum over a sequence of quadrature points {<Jj}j^i, we obtain 

and the constant of proportionality is chosen such that the weights aj sum to one. This is an example 
of a Gaussian mixture distribution, namely a weighted sum of Gaussians. In this case, the Gaussians are 
all centred on a; = 0; the most general form of a Gaussian mixture is 



J 1 



with weights oij > subject to '}2'j^i oij = ^, and locations — oo < Xj < oo. Univariate Gaussian mixtures 
generalize to mixtures of multi-variate Gaussians in the obvioiis manner. 

Example (Point distribution). As a final example, we consider the point measure iJ.xo defined by 

/(x)/i^„(da;) = /(a;o). 

X 

Using the Dirac delta notation d{-) this can be formally written as /Ltxo(da;) = S{x — Xo)dx. The associated 
random variable X has the certain outcome X{oj) = xo for almost all ui G fl. One can call such a random 
variable deterministic, and write X — xq for short. Note that the point measure is not absolutely 
continuous with respect to the Lebesgue measure, i.e., there is no corresponding probability density 
function. 
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We now briefly discuss pairs of random variables Xi and X2 over the same target space X. Formally, 
we can treat them as a single random variable Z = {Xi,X2) over Z = X x X with a joint distribution 

MXiX2(a;i,a;2) = ^J-z{z)■ 

Definition (Marginals, independence, conditional probability distributions). Let Xi and X2 denote two 
random variables on X with joint PDF ttxiXsC^ci, 2:2). The two PDFs 



X 



Trxii^i) = I 7rj(:iX2(a;i,a;2)da;2 

and 



T^Xiixi) = / 7rxiX2(a;i,a;2)da;i, 
Jx 

respectively, are called the marginal PDFs, i.e. Xi wxi and X2 ^ 77x2- The two random variables are 
called independent if 

7^X1X2 2:2) = TTXi 77x2(2:2 )• 
We also introduce the conditional PDFs 

, I X 7rxiX2(a;i,a;2) 

and 

T^XAX2\X1) = , . ■ 

TTXi [Xl ) 

Example (Gaussian joint distributions). A Gaussian joint distribution ■KxY{x,y), a;,y e M, with mean 

(x, y) and covariance matrix 

E 



2 2 

^ XX ^ xy 

2 2 

^yx ^yy 



leads to a Gaussian conditional distribution 



7rx(a;|y) = ^^e-(--^=)V(2a,^), (1) 



with conditional mean 

^ — ^ J- / 

'xy^yy 



Xc = x + al u hy-y) 



and conditional variance 



2 _ 2 _ 2 ^-22 

c XX xy yy yx' 



For given y, we define X|y as the random variable with conditional probability distribution Trx{x\y), and 
write X\y ~ N(a;c, a'^). 



1.2 Bayesian inference 

We start this section by considering transformations of random variables. A typical scenario is the 
following one. Given a pair of independent random variables S with values in 3^ = and X with values 
in X = together with a continuous map h : — )• M^, we define a new random variable 

Y = h{X) + E. (2) 

The map h is called the observation operator, which yields observed quantities given a particular value x 

of the state variable X, and S represents measurement errors. 

Theorem (PDF for transformed random variable). Assume that both X and S are absolutely continuous, 
then Y is absolutely continuous with PDF 

'^Y{y)= I n^iy - h{x))'Kx{x)dx. (3) 
Jx 

If X is a deterministic variable, i.e. X = xq for an appropriate xo € M.^ , then the PDF simplifies to 

nriy) = nsiy - h{xo)) . 
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Proof. We start with X ~ xq- Then Y — h{xo) — ^ which immediately imphes the stated result. In the 
general case, consider the conditional probability 

■^viylxo) = 7rH(y - h{xo)). 

Equation ^ then follows from the implied joint distribution 

t^xy{x,v) tty{v\x)tix{x) 

and subsequent marginalization, i.e. 



T^viy) ^ I nxY{y,x)dx= / 7rYiy\x)TTx{x)dx. 
Jx Jx 



□ 



The problem of predicting the distribution Try of Y given a particular configuration of the state 
variable X = xq \s called the forward problem. The problem of predicting the distribution of the state 
variable X given an observation Y = yo gives rise to an inference problem, which is defined more formally 
as follows. 

Definition (Bayesian inference). Given a particular value yo € R^, we consider the associated condi- 
tional PDF TTx{x\yo) for the random variable X. From 

'^XY{x,y) = TrY{y\x)Trx{x) = 7rx{x\y)TrY{y) 

we obtain Bayes ' formula 

The object of Bayesian inference is to obtain TTxix\yo). 

Since 7ry(?/o) 7^ is a constant, Equation Q can be written as 

TTxixlyo) cx Trxiyo\x)'!rx{x) = 7rH(yo - h{x))Trxix), 

where the constant of proportionality depends only on j/o- We denote ttx^x) the prior PDF of the random 
variable X and 7rx(a;|yo) the posterior PDF. The function TT{yo\x) is called the likelihood function. 

Having obtained a posterior PDF TTx{x\yo), it is often necessary to provide an estimate of a "most 
likely" value of x conditioned on yo- Bayesian estimators for x are defined as follows. 

Definition (Bayesian estimators). Given a posterior PDF 7rx(a;|yo) we define a Bayesian estimator 
X e X hy 



arg min^,g;f J L{x' , x)nxix\yo)dx 



where Li[x',x) is an appropriate loss function. Popular choices include the maximum a posteriori (MAP) 
estimator with x corresponding to the modal value of nx {x\yo). The MAP estimator formally corresponds 
to the loss function L{x' , x) — l{x'^x}- The posterior median estimator corresponds to L(x', x) — 
while the minimum mean square error estimator (or conditional mean estimator) 



L 



xTTx{x\yo)dx 



results from Ij{x' , x) — 



\x' -xP. 



We now consider an important example for which the posterior can be computed analytically. 
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Example (Bayes' formula for Gaussian distributions). Consider the case of a scalar observation, i.e. K 
1, with 7V(0,cr2j. Then 



We also assume that X ~ N(a;, P) and that h{x) — Hx. Then the posterior distribution of X given 
y = 2/0 is also Gaussian with mean 

Xc = x- PH^{HPH^ + alX^{Hx - yo) 

and covariance matrix 

Pc = P- PH'^iHPH'^ + alX^HP. 

These are the famous Kalman update formulas which follow from the fact that the product of two 
Gaussian distributions is also Gaussian, where the variance oiY = HX + E is given by 

aly = HPH'^ + al^ 

and the vector of covariances between x € and y — Hx e M is given by PH^ . For Gaussian random 
variables, the MAP, posterior median, and minimum mean square error estimators coincide and are given 
by Xc- The case of vector- valued observations will be discussed in Section Finally note that Xc solves 
the minimization problem 



arg mm 



^ {x - xfp-\x -x) + ^{Hx - yof 



cGM" [2' ' ' ' 2R 

which can be viewed as a regularization of the ill-posed inverse problem 

yo = Hx, x£ R^, iV > 1, 

in the sense of Tikhonov. A standard Tikhonov regularization would be based on P~^ = 51 with the 
regularization parameter 5 > Q appropriately chosen. In the Bayesian approach to inverse problems the 
regularization term is instead determined by the Gaussian prior ttx • 

We mention in passing that Bayes' formula has to be replaced by the Radon-Nikodym derivative in 
the case where the prior distribution is not absolutely continuous with respect to the Lebegue measure 
(or in case the space X does not admit a Lebesgue measure). Consider as an example the case of an 
empirical measure /ix centered about the M samples Xi X , i — 1, . . . , M , i.e. a weighted sum of point 
measures given by 

1 ^' 

^J,x{dx) = — ^^^,(dx). 

Then the resulting posterior measure /ix('|yobs) is absolutely continuous with respect to i^x, i-e. there 
exists a Radon-Nikodym derivative such that 

f{x)^Ix{dx\ya) = / f{x)— -—^ix{dx) 



and the Radon-Nikodym derivative satisfies 

dl^xix\yo) 



oc TTE{h{x) - yo). 



dfixix) 

Furthermore, the explicit expression for the posterior measure is given by 



^J'xidx\yo) = w» ^a:.(dx), 

i=l 



with weights Wi > defined by 

(X TTs{h{xi) - yo), 

and the constant of proportionality is determined by the condition X]f=i '"'i = ^■ 
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1.3 Coupling of random variables 

We have seen that under Bayes' formula a prior probabihty measure fJ-xi') on X is tranformed into 
a posterior probability measure fix{-\yo) on X conditioned on the observation yo — Y{uj). With each 
of the probability measures, we can associate random variables such that, e.g., Xi ~ nx and X2 ^ 
fJ'xi'lyo)- However, while Bayes' formula leads to a transformation of measures, it does not imply a 
specific transformation on the level of the associated random variables; many different transformations 
of random variables lead to the same probability measure. In this section we will, therefore, introduce 
the concept of coupling two probability measures. 

Definition (Coupling). Let iixi and (1x2 denote two probability measures on a space X. A coupling of 
liXi and fix2 consists of a pair Z — {Xi,X2) of random variables such that Xi ^ l^-Xi, X2 ^ AiX2i and 
Z ^ fiz- The joint measure fiz on the product space Z = X x X, is called the transference plan for this 
coupling. The set of all transference plans is denoted by Il{fj,Xi, 1^X2)- 

Here, we will discuss different forms of couplings assuming that both the source and target distributions 
are explicitly known, whilst applications to Bayes formula ^ will be discussed in Sections 11.41 and [31 In 
practice, the source distribution needs often to be estimated from available realizations of the underlying 
random variable Xi. This is the subject of parametric and non-parametric statistics and will not be 
discussed in this survey paper. In the context of Bayesian statistics, knowlege of the source (prior) 
distribution and the likelihood implies knowledge of the target (posterior) distribution. 

Since prior distributions in Bayesian inference are generally assumed to be absolutely continuous, the 
discussion of couplings will be restricted to the less abstract case oi X — and (dx) = ttxi {x)Ax, 
/xjs:2(dx) = ■Kx2{x)(lx. In other words, we assume that the marginal measures are absolutely continuous. 
We will, in general, not assume that the coupling is absolutely continuous on Z = X x X = R^^. Clearly, 
couplings always exist since one can use the trivial product coupling 

T^z{xi,X2) = ■nxx{xi)iTx2{x2), 

in which case the associated random variables Xi and X2 are independent. The more interesting case is 
that of a deterministic coupling. 

Definition (Deterministic coupling). Assume that we have a random variable Xi with law and a 
second probability measure ^X2- A diffeomorphism T : X ^ X is called a transport map if the induced 
random variable X2 = T{Xi) satisfies 

f{x2)l^X2{<dx2) ^ / /(r(a;i))^xi(da;i) 
X Jx 

for all suitable functions f : X M.. The associated coupling 

^izidxi,dx2) = S{x2 - T{xi))nxi{<ixi)dx2, 

where 5{-) is the standard Dirac distribution, is called a deterministic coupling. Note that nz is not 
absolutely continuous even if both fixi and (1x2 are. 

Using 

f{x2)S{x2 ~ T{x,))dx2 = /(T(xi)), 



ix 

it indeed follows from the above definition of fiz that 



f{x2)fiX2{dx2) ^ fix2)fJ.zidxi,dx2)^ f{T{xi))nxi{dxi). 
X Jz Jx 

We discuss a simple example. 

Example (One-dimensional transport map). Let Trxi{x) > and ttx2{x) > denote two PDFs on 
X = R. We define the associated cumulative distribution functions by 

FxAx) = / nxi{x')dx', Fx2{x) ^ / tt^^ (a;')da;'. 
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Since is monotonically increasing, it has a unique inverse F^^{p) for p e [Oj !]■ The inverse may be 
used to define a transport map that transforms Xi into X2 as folfows, 

X2^T{X,)^F^l{FxAXi)). 

For example, consider the case where Xi is a random variable with uniform distribution U([0, 1]) and 
X2 is a random variable with standard normal distribution N(0, 1). Then the transport map between Xi 
and X2 is simply the inverse of the cumulative distribution function 

V Ztt J -00 

which provides a standard tool for converting uniformly distributed random numbers to normally dis- 
tributed ones. 

We now extend this transform method to random variables in with N = 2. 

Example (Knothe-Rosenblatt rearrangement). Let irxiix^ ,x'^) and TTx^ix^^xP') denote two PDFs on 
X — {x^,x'^) G M^. A transport map between i:xi and 'ITX2 can be constructed in the following manner. 
We first find the two one-dimensional marginals 7r^i(a:^) and T^x^i^^) of the two PDFs. In the previous 
example we have seen how to construct a transport map X2 = Ti{X\) which couples these two one- 
dimensional marginal PDFs. Here X} denotes the first component of the random variables Xi, i — 1,2. 
Next we write 

7rxi(a;\a;2) = tt^i (a;^|a;^)7rjfi (a;^), irx^ix'^ ,x'^) = 7rx2(a;^|a;^)7r^i(a::^) 

and find a transport map X| = T2{Xl, Xi) by considering one-dimensional couplings between ttx^ {x'^\x^) 
and TTx2{x'^\T{x^)) with x^ fixed. The associated joint distribution is given by 

Trz{xl,xl,xl,X2) = d{xl - Ti(xl))S{xl ~ T2{xl,xl))TrxA^i^Xi)- 

This is called the Knothe-Rosenblatt rearrangement, also well-known to statisticians under the name 
of conditional quantile transforms. It can be extended to M^, > 3 in the obvious way by introducing 
the conditional PDFs 

TTXi[x'^\x^,X^), TTX2[X'^\X^,X^), 

and by constructing an appropriate map Af = TAXl,Xl,X'i) from those conditional PDFs for fixed 
pairs {x\,x\) and {x\,x'^) = {Ti{x\),T2{x\,x\)) etc. While the Knothe-Rosenblatt rearrangement can 
be used in quite general situations, it has the undesirable property that the map depends on the choice 
of ordering of the variables i.e., in two dimensions a different map is obtained if one instead first couples 
the x'^ components. 

Example (Affine transport maps for Gaussian distributions). Consider two Gaussian distributions 
N(a;i,I]i) and '^i{x2,^2) in with means xi and X2 and covariance matrices Si and S2, respectively. 
We first define the square root Y}/'^ of a symmetric positive definite matrix E as the unique symmetric 
positiv definite matrix which satisfies S^/^E^/^ = E. Then the affine transformation 

X2 = T{xi) = X2 + Yy'^T.i^^'^{xi- Xi) (5) 
provides a deterministic coupling. Indeed, we find that 

[X2 - X2)^Y'2^{X2 - X2) = {Xi - Xi)'^Ej'^(xi - Xi) 

under the suggested coupling. The proposed coupling is, of course, not unique since 

X2 = T{xi) = X2 + E2^^QEj^^^^(a;i - xi), 

where Q is an orthogonal matrix, also provides a coupling. We will see in Section 13.31 that a coupling 
between Gaussian random variables is also at the heart of the ensemble square root filter formulations of 
sequential data assimilation. 
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Deterministic couplings can be viewed as a special case of a Markov process {Xn}n£{i,2} defined by 

7rx2(a;2)= / 7r(a;2|a;i)7rxi(a;i)da;i, 

where tt(x2\xi) denotes an appropriate conditional PDF for the random variable X2 given Xi — x\. 
Indeed, we simply have 

7r(a;2|2:i) = 8{x2 - T{xx)) 

for deterministic couplings. We will come back to Markov processes in Section (5] 

The trivial coupling ttz{xi, X2) ~ ti'Xi (2;i)7rx2 (2^2) leads to a zero correlation between the induced 
random variables Xi and X2 since their covariance is 

cov(Xi,X2) = E2[(a;i - Xi){x2 - X2)'^] = Ez[xiX2] - xixj = 0, 

where Xi = Ex^ [x] . A transport map leads instead to the covariance matrix 

cov{Xi,X2)^Ez[xix'^]^ExAxi]iKxAx2]f ^^xAxiTixif]-xix'^, 

which is non-zero in general. If several transport maps exists then one could choose the one that maxi- 
mizes the covariance. Consider, for example, univariate random variables Xi and X2, then maximising 
their covariance for given marginal PDFs has an important geometric interpretation: it is equivalent to 
minimizing the mean square distance between xi and 2^(2:1) ~ X2 given by 

Ez[\x2 - = ExA\xi\^] + ExA\x2\^] - '2Kz[xiX2] 

= Ex, [\xi\^] + ExA\x2\^] - 2Ez[{xi - xi){x2 - X2)] - 2xiX2 
= ExJ|.Ti|2] +Ex2[N2p] - 2x1X2 - 2cov(Xi,X2). 

Hence finding a joint measure ^z that minimizes the expectation of (xi — X2)^ simultaneously maxi- 
mizes the covariance between Xi and X2- This geometric interpretation leads to the celebrated Monge- 
Kantorovitch problem. 

Definition (Monge-Kantorovitch problem). A transference plan n*^ e Il{pxi, 1^X2) is called the solution 
to the Monge-Kantorovitch problem with cost function c(xi,X2) = ||xi — X2IP if 

^2 = arg inf Ez[||xi - X2IP]. (6) 

The associated function 

W^(mxi,M^2) ^Ez[\\xi -X2IP], law(Z) 
is called the L^-Wasserstein distance of ^Xi and fix2 ■ 

Theorem (Optimal transference plan). // the measures fJ.Xi, i = 1,2, are absolutely continuous, then 
the optimal transference plan that solves the Monge-Kantorovitch problem corresponds to a deterministic 
coupling with transfer map 

A2 =T(Xi) = V,V(^i), 

for some convex potential -0 : — > M. 

Proof. We only demonstrate that the solution to the Monge-Kantorovitch problem is of the desired form 
when the infimum in ([6]) is restricted to deterministic couplings. See |49j for a complete proof and also 
for more general results in terms of subgradients and weaker conditions on the two marginal measures. 

We denote the associated PDFs by nxi, i = 1,2. We also introduce the inverse transfer map Xi — 
S{X2) = T~^(A"2) and consider the functional 

^S,^^l [ ||5(x)-x||Vx2(x)dx + 

/ [^iS{x))7TX2 (x) - *(x)7rxi (x)] dx 
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in S and a potential ^' : ^ R. We note that 

[7:xAT{x))\DT{x)\ - 7rx,(x)] dx 



by a simple change of variables. Here \DT{x)\ denotes the determinant of the Jacobian matrix of T 
at X and the potential ^' can be interpreted as a Lagrange multiplier enforcing the coupling of the two 
marginal PDFs under the desired transport map. 

Taking variational derivatives with respect to S and we obtain two equations 

^ = 7:xAx) KS{x) -x)+ V..^{Six))] = 

and 

^^^nxAx)+nxAT{x))\DT{x)\=0 (7) 
characterizing critical points of the functional C. The first equality implies 

X2 = xi+ V^*(xi) = [\xjxi + ^(xi)^ W^i^ixi) 

and the second recovers our Ansatz that T transforms ttxi into Tr^a as a result of the Lagrange multiplier 

Example (Optimal transport maps for Gaussian distributions). Consider two Gaussian distributions 
N(5i,Ei) and N(x2,S2) in K.^ with means xi and X2 and covariance matrices Si and S2, respectively. 
We had previously discussed the deterministic coupling ([5]). However, the induced afRne transformation 
X2 — T[xi) cannot not be generated from a potential tp since the matrix Sj^^Sj^ is not symmetric. 
Indeed the optimal coupling in the sense of Monge-Kantorovitch with cost function c{xi,X2) — \\xi—X2\\'^ 
is provided by 



X2 = T{xi) := X2 + "^^1^^ Sj'^SiSg'^ Ej'"" (xi - xi) . (8) 



^1/2 



'1/2 



1/2, 



See [40] for a derivation. The following generalization will be used in Section 13.31 Assume that a matrix 
A G K-'^x^f is given such that E2 = AA"^ . Clearly we can chose A = ^2''^ in which case M — N and A is 
symmetric. However we allow for A to be non-symmetric and M can be different from N. An important 
observation is that one can replace £2^^ in ([5]) by ^ and A^ , respectively, i.e. 

T{xi) = X2 + A [A^Si^] A^ {xi - xi) . (9) 

While optimal couplings are of broad theoretical and practical interest, their computational imple- 
mentation can be very demanding. In Section [31 we will discuss an embedding method originally due to 
Jiirgen Moser |37) , which leads to a generally non-optimal but computationally more tractable formulation 
in the context of Bayesian statistics and data assimilation. 



1.4 Monte Carlo methods 

Monte Carlo methods, also called particle or ensemble methods depending on the context in which they 
are being used, can be used to approximate statistics, namely expectation values Ex[/], for a random 
variable X. We begin by discussing the special case f{x) = x, namely, the mean. 

Definition (Empirical mean). Given a sequence Xi, i = 1, . . . , M, of independent random variables with 
identical measure /xx, the empirical mean is 

^ M ^ M 

i=l i=l 

with samples Xi = Xi{uj). 
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Of course, xm itself is the realization of a random variable Xm and we consider the mean squared 
error (MSE) 

MSE(x) =E^^J(5M-i)'] 

= (E^,, [xm] - xf + E;f^^ [{xm - ^x,, [xm]?] (10) 

with respect to the exact mean value x = Ex [a;]. We have broken down the MSE into two components: 
squared bias and variance. Such a decomposition is possible for any estimator and is known as the 
bias-variance decomposition. The particular estimator Xm is called unbiased since Ej^^^ [xm] = x for any 
M > 1. Furthermore Xm converges weakly to x under the central limit theorem provided has finite 
second-order moments, i.e. 

hm E^^,^ [(xM-^x,,[xM]f] =0. 

M — >oo 

It remains to generate samples Xi = Xi (w) from the required distribution. Methods to do this include 
the von Neumann rejection method and Markov chain Monte Carlo methods, which we will briefly discuss 
in Section [2j Often the prior distribution is assumed to be Gaussian, in which case explicit random 
number generators are available. We now turn to the situation where samples from the prior distribution 
are available, and are to be used to approximate the mean of the posterior distribution (or any other 
expectation value). 

Importance sampling is a classical method to approximate expectation values of a random variable 
X* ~ TTjft using samples from a random variable X^ ^ ttxp, which requires that the target PDF ttx* 
is absolutely continuous with respect to proposal PDF ttxp ■ This is the case for the prior and posterior 
PDFs from Bayes' formula i.e. we set the proposal distribution ttxp{x) equal to the prior distribution 
TTxix) and the posterior distribution 7rx(a;|?/o) oc 7ry(j/o|a;)7rx(a;) becomes the target distribution irxtix). 

Definition (Importance sampling for Bayesian estimation). Let xf"™, i — 1,...,M, denote samples 
from the prior PDF nxix), then the importance sampler estimate of the mean of the posterior 7rx{x[yo) 
is 

M 

-post _ prior /-, -, n 

■''M ~ 2-^ '^%Xi \H) 

i=\ 



with importance weights 

„„ _ 7ry(yo|a;f'°'') 



EA'I / I prior \ 

^=l^^Y[yo[x^ ) 



(12) 



Importance sampling becomes statistically inefficient when the weights have largely varying magni- 
tude, which becomes particularly significant for high-dimensional problems. To demonstrate this effect 
consider a uniform prior on the unit hypercube 1^ = [0, 1]^. Each of the M samples Xi from this prior 
formally represent a hypercube with volume 1/M. However, the likelihood measures the distance of a 
sample Xi to the observation ijq in the Euclidean distance and the volume of a hypersphere decreases 
rapidly relative to that of an associated hypercube as N increases. Within the framework of the bias- 
variance decomposition of a mean squared error such as (jlOp . the curse of dimensionality manifests itself 
in large variances for finite M. 

To counteract this curse of dimensionality, one may utilize the concept of coupling. In other words, 
assume that we have a transport map a;^"*'* = T{xP"°'-') which couples the prior and posterior distributions. 
Then, with transformed samples x^°^^ — T(a;f'^'°'^), i — 1, . . . , Af, we obtain the estimator 

M 

-post V ^ ^ post 

^M — 2^ WiX^ 

1=1 

with equal weights Wi = 1/M. 

Sometimes one cannot couple the prior and posterior distribution directly, or the coupling is too 
expensive computationally. Then one can attempt to find a coupling between the prior PDF T^xix) 
and an approximation 7fx(x|yo) to the posterior PDF 7rx(x|?/o) c< T:Y{yo\x)'nx{x). Given an associated 
transport map X?™? = f'(Js:P"°''), i.e. 

nx{f{x)\yo)^Tix{x)[Df{x)\-\ 



11 



one then takes TTx{x\yo) as the proposal density nxp^x) in an importance sampler with realizations xf^°^, 
i = 1, . . . , M, defined by 



x: 



prop rp/ prior 



T{xm- 



An asymptotically unbiased estimator for the posterior mean is now provided by 

M 



X 



post 
M 



E^^^r^ (13) 



i=l 

with weights 



T^Y( yo\xf°^)7Tx{xf'°'P) , , I ^prop M ^ ^prior ^ | ^TX (xf °P ) 



i = 1, . . . The constant of proportionality is chosen such that X]f=i ~ 1- Indeed, if 7rxp(a;) = 
Ti'x{x\yo) = T^xixlyo), we recover the case of equal weights Wi = 1/M, and 7rxp(a;) = Ti'x{x\yo) — t^x{x 
leads to standard importance sampling using prior samples, i.e. x^""^ = xY^°^ . 

We will return to the subject of sampling from the posterior distribution in Sections 12.31 and 13.21 



References 

An excellent introduction to many topics covered in this survey is [22j . Bayesian inference and a Bayesian 
perspective on inverse problems are discussed in [24] , [38] , [31] . The monographs [49l [5^ provide an in 
depth introduction to optimal transportation and coupling of random variables. Monte Carlo methods 
are covered in 02]. We also point to [50] for a discussion of estimation and regression methods from a 
bias-variance perspective. A discussion of infinite-dimensional Bayesian inference problems can be found 
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2 Elementary stochastic processes 

In this section, we collect basic results concerning stochastic processes which are of relevance for the data 
assimilation problem. 

Definition (Stochastic process). Let T be a set of indices. A stochastic process is a family {Xt]t^T of 
random variables on a common space A", i.e. Xt{ui) € X . 

In the context of dynamical systems, the variable t corresponds to time. We distinguish between 
continuous time t e [O,tond] C M or discrete time i„ = nAi, n G {0,1,2,...} = T, with Ai > a 
time-increment. In cases where subscript indices can be confusing we will also use the notations X(t) 
and X{tn), respectively. 

A stochastic process can be seen as a function of two arguments: t and w. For fixed uj, Xt{uj) becomes 
a function of t G T, which we call a realization or trajectory of the stochastic process. We will restrict to 
the case where Xt{Lj) is continuous in t (with probability 1) in the case of a continuous time. Alternatively, 
one can fix the time t € T and consider the random variable Xt{-) and its distribution. More generally, one 
can consider Z-tuples {ti,t2, . . . ,ti) and associated Z-tuples of random variables {Xt-^ (•), Xt^i-), . . . , Xt, (•)) 
and their joint distributions. This leads to concepts such as temporal correlation. 

2.1 Discrete time Markov processes 

First, we develop the concept of Markov processes for discrete time processes. 

Definition (Discrete time Markov processes). The discrete time stochastic process {Xn}neT with X = 
and T — {0, 1,2,.. .) is called a (time-independent) Markov process if its joint PDFs can be written 

as 

TTnixQ^xi, ...,x„)= 7r(x„|a;„_i)7r(a;„_i|a;„_2) • • • 7r(xi |a:o)7ro(xo) 
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for all n € {0,1,2,...} — T. The associated marginal distributions 7r„ = ttjc^ satisfy the Chapman- 
Kolmogorov equation 

„.«M=/ M-'kKWd. (15) 



and the process can be recursively repeated to yield a family of marginal distributions {7r„}„gT for given 
ttq. This family can also be characterized by the linear Frohenius- Perron operator 

7r„+i = V^n, (16) 

which is induced by (ITSl) . 



The above definition is equivalent to the more traditional definition that a process is Markov if the 
conditional distributions satisfy 

7r„(a:„|a;o,a;i, . . . ,a;„_i) = 7r(a;„|x„_i). 

Note that, contrary to Bayes' formula (jH), which directly yields marginal distributions, the Chapman- 
Kolmogorov equation (|15p starts from a given coupling 

7rx„+iX„(a;„+i,x„) 7r(a;„+i |a;„)7rx„ (a;„) 

followed by marginalization to derive ■nxn^-i{xn+i)- A Markov process is called time-dependent if the 
conditional PDF 7r(x'|a;) depends on While we have considered time-independent processes in this 
section, we will see in Section[3]that the idea of coupling applied to Bayes' formula leads to time-dependent 
Markov processes. 

2.2 Stochastic difference and differential equations 

We start from the stochastic difference equation 

+ At, (17) 

where At > is a small parameter (the step-size), / is a given (Lipschitz continuous) function, and 
Zn ^ N(0, Q) are independent and identically distributed random variables with correlation matrix Q. 

The time evolution of the associated marginal densities 7rx„ is governed by the Chapman-Kolmogorov 
equation with conditional PDF 

7r(x'|a;) 



(47rAt)^/2|Q|i/2 

exp (-^(a;' -X- Atf{x)fQ-\x' -x- Atf{x))^ . (18) 

Proposition (Stochastic differential and Fokker-Planck equation). Taking the limit At 0, one obtains 
the stochastic differential equation (SDE) 

AXt = f{Xt)dt + V2Q^/^dWt (19) 

for Xt, where {Wtjoo denotes standard N -dimensional Brownian motion, and the Fokker-Planck equa- 
tion 

^ = -V, • (nxf) + V, • (QV.vrx) (20) 

for the marginal density nx{x, t). Note that Q = (no noise) leads to the Liouville, transport or continuity 
equation 

^ = -V. • (nxf), (21) 
which implies that we may interpret f as a given velocity field in the sense of fluid mechanics. 
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Proof. The difference equation (jl7p is called the Euler-Maruyama method for approximating the SDE 
(fT9)) . See [211 [26] for a discussion on the convergence of (HZ]) to (HH) at At 0. 

The Fokker-Planck equation (|20p is the linear combination of a drift and a diffusion term. To simplify 
the discussion we derive both terms separately from ((TT]) by first considering f ~ 0, Q ^ and then 
Q = 0, / 7^ 0. To simplify the derivation of the diffusion term even further we also assume a: G M and 
(5 = 1. In other words, we show that scalar Brownian motion 

dXt = V2dWt 

leads to the heat equation 

dnx d'^TTx 
dt dx'^ 

We first note that the conditional PDF (fTSl) reduces to 

{x' - xf 

under f{x) = 0, Q = 1, = 1, and the Chapman-Kolmogorov equation ([T5]) becomes 

^„+i(.t')= / ^=i=e-^'/(4^*)^„(x' + y)dy (22) 
under the variable substitution y — x — x' . We now expand 7r„(a;' + y) in ?/ about y = 0, i.e. 

nnix' + y) = Mx') + y^ix') + ^^(^') + • • • , 
and substitute the expansion into 



tt{x'\x) = (47rAi)"^/^exp ^- 



Jm V47rAt 

Jr V47rAi ox 



^/4^ 2 9a;2 

The integrals correspond to the zeroth, first and second-order moments of the Gaussian distribution with 
mean zero and variance 2 At. Hence 

TTn+lix') = 7r„(x') + At-^ix') + ■■■ 

and it can also easily be shown that the neglected higher-order terms contribute with 0{At'^) terms. 
Therefore 

At ax^ 

and the heat equation is obtained upon taking the limit At — > 0. The non- vanishing drift case, i.e. f{x) ^ 
0, while being more technical, can be treated in the same manner. 

One can also use ([7]) to derive Liouville's equation (|2ip directly. We set 

T{x) = a; + Atjix) 

and note that 

\DT(x)\ = 1 + AiV^ • / + 0{At^). 

Hence ([7]) implies 

T^x, = TTx. + Ai^x. V, • / + At(V,7r,J • / + ©(At^) 

and 

^^^^^^^ = -V.-(^x./) + 0(Ai). 
Taking the Umit At -> 0, we obtain dH]). □ 
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Following the work of Felix Otto (see, e.g., [lU |49]), we note that in the case of pure diffusion, i.e. 
/ = 0, the Fokker-Planck equation can be rewritten as a gradient flow system. We first introduce some 
notation. 

Definition (differential geometric structure on manifold of probability densities). We formally introduce 
the manifold of all PDFs on X = M.^ 

{tt : ^ R : 7r(a;) > 0, / 7r(x)da; = 1} 

with tangent space 

T^M = {0 : R^ ^ R : / 4,{x)dx = 0}. 
The variational derivative of a functional : 7W -> R is defined as 

^0d.^lim^(- + -^)-^(-). 

where is a function such that Jj^jv 0dx = 0, i.e. G T^Al. 
Consider the potential 

V{-nx)^ / TTxlnTTxdx, (23) 



which has functional derivative 

8V 



since 



In TTx, 



V{ttx + e<p) = V{Trx) + e / {(j)linrx + (f>)dx + O(e^) 
= V{ttx) + e / ^luTTx dx + O(e^), 

JR« 

Hence, we find that the diffusion part of the Fokker-Planck equation is equivalent to 

dnx ( SV 1 

• (QV^TTx) = • <^ TTxQ^.^ } ■ (24) 



This formulation allows us to treat diffusion in form of a vector field 

5V 

v{x,t) = -QV, 



Sttx 

which, contrary to vector fields arising from the theory of ordinary differential equations, depends on the 
PDF TTx. See the following Section [^75] for an application. 

Proposition (Gradient on the manifold of probability densities). Let be a metric tensor defined on 
T^M as 



ff^(</'l>2) = / (V:,Vl) • (MV^V2) TTdx 

with potentials ipi, i — 1,2, determined by the elliptic partial differential equation (PDE) 

-V^ • (ttMV^i/;,) = 0,, 

where M G R^^^ is a symmetric, positive- definite matrix. 
Then the gradient of a potential F{7t) under i?^ satisfies 

grad,i^(7r) = -V, • (nMV,^\ . (25) 
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Proof. Given the metric tensor g^^, the gradient is defined by 



5F 

5.(grad,F(7r),0) = / —^dx (26) 



for all (f> e Tj^Ad. Since any element e Tj^M can be written in the form 

= -V^ • (TrMV^T/-) 
with suitable potential ip, a potential ip exists such that 

grad,F(7r) = -V, • (^MV,^) G T^M 
and we need to demonstrate that 



OTT 



is consistent with (1261). Indeed, we find that 



5F r 5F 

—(fidx = - / — • (7rMV^7/;)da; 

OTT . BN OTT 



r SF 

/ ttV^— • (MV:,V')dx 

/ (V:,^') • (MV:,V)7rda; 
5^(gradi^(7r),0). 



□ 



It follows that the diffusion part of the Fokker-Planck equation can be viewed as a gradient flow on 
the manifold M. More precisely, set F{7t) — V{ttx) and M = Q to reformulate (|24l) as a gradient flow 

^ = -grad^x^(^x) 

with potential ([25)) . We will find in Section [3] that related geometric structures arise from Bayes' formula 
in the context of filtering. We finally note that 

dV f 5V dirx , 
-dx 



dt Jjjjv 5ttx dt 



MVj;-j ) TTxdx < 0. 



2.3 Ensemble prediction and sampling methods 

In this section, we extend the Monte Carlo method from Section [1.41 to the approximation of the marginal 
PDFs nx{x,t), t > 0, evolving under the SDE model (fT9|) . Assume that we have a set of independent 
samples Xi{0), i = 1, . . . , M, from the initial PDF ttx{x, 0). 

Definition (ensemble prediction). A Monte Carlo approximation to the time-evolved marginal PDFs 
TTx{x,t) can be obtained from solving the SDEs 

dx, = f{x^)dt + ^Q'l^dW, (t) (27) 

for 1 = 1,..., M, where {Wi{t)}f£-^ denote realizations of independent standard A^-dimensional Brownian 
motion and the initial conditions {xi{0)}fLi are realizations of the initial PDF ttx{x,0). This approxi- 
mation provides an example for a particle or ensemble prediction method and it can be shown that the 
estimator 



1 

1=1 

provides a consistent and unbiased approximation to the mean E^j [a;] . 
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Alternatively, using formulation ((24|) of the Fokker-Planck equation (|20p in the pure diffusion case, 
we may reformulate the random part in (j27p and introduce particle equations 



dxi , , W . 

-TT = iK^i) - Q^x-^ \X^) 

at oTix 

= !{xi) 7^0V,^x(a;.,i), (29) 

i = 1,...,M. Contrary to the SDE 1^7^, this formulation requires the PDF TTx{x,t), which is not 
explicitly available in general. However, a Gaussian approximation can be obtained from the available 
ensemble Xi{t), i = 1, . . . , M, using 

with empirical mean (j28p and empirical covariance matrix 

1 

P = j^—j^ixi ~ XM){xi - xm)'^ ■ (30) 
1=1 

Substituting this Gaussian approximation into (j29p yields the ensemble evolution equations 

fix.)+QP-\x,-XM), (31) 

which becomes exact in case the vector field / is linear, i.e. f{x) = Ax + u, the initial PDF nx{x, 0) is 
Gaussian and for ensemble sizes M ^ oo. 

We finally discuss the application of a particular type of SDEs (|T9l) as a way of generating samples 
Xi from a given PDF such as the posterior Trx{x\yo) of Bayesian inference. To do this, consider the SDE 
with the vector field / being generated by a potential U{x), i.e. f{x) ~ ~V xU{x), and Q = I. Then 
it can easily be verified that the PDF 

(x) ^ Z'^exp{-U{x)), Z=( exY>{-U{x))dx, 



"X 

is stationary under the associated Fokker-Planck equation (|20|) . Indeed 

V:, • (tT^VxC/) + • VxTT*x = V:, • (tT^V^^C/ + V^^TT^) = 0. 

Furthermore, it can be shown that tt^ is the unique stationary PDF and that any initial PDF nx {t = 0) 
approaches tt^ at exponential rate under appropriate assumption on the potential V. Hence Xt ^ tt^ 
for t — >■ cx). This allows us to use an ensemble of solutions Xi{t) of (P7|) with an arbitrary initial PDF 
Trxix.,0) as a method for generating ensembles from the prior or posterior Bayesian PDFs provided 
U{x) = — ln7rx(a;) or U{x) = —\mTx{x\yo), respectively. Note that the temporal dynamics of the 
associated SDE (|19l) is not of any physical significance in this context instead the SDE formulation is 
only taken as a device for generating the desired samples. If the SDE formulation is replaced by the 
Euler-Maruyama method (jl7l) . time-stepping errors lead to sampling errors which can be corrected for 
by combining ([TT)) with a Metropolis accept-reject criterion. The Metropolis adjusted method gives rise 
to particular instances of Markov chain Monte Carlo (MCMC) methods such as the Metropolis adjusted 
Langevin algorithm (MALA) or the hybrid Monte Carlo (HMC) method. The basic idea of MALA (as 
well as HMC) is to rewrite (H?]) with f{x) = -Va;C/(a;), Q = / as 

Pn+l/2 =Pn- ^V2At\/xU{Xn), (32) 



2 

Xn+l = Xn + V2Atpn+i/2, (33) 
Pn+1 ^Pn+1/2 - l:V2At\/xU{pn) (34) 
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having introduced a dummy momentum variable p with p„ being a realization of the random variable 
Zn ~ N(0,/). Under the Metropolis accept-reject criterion Xn+i is accepted with probability 

min{l,exp(- (£'„+! - En))}, 

where ^ ^ 

En = -^plPn + U [Xn), E^+l = -jPn+lPn+l + U {Xn+l) 

are the initial and final energies. Upon rejection one continues with Xn- The momentum value Pn+i is 
discarded after a completed time-step (regardless of its acceptance or rejection) and a new momentum 
value is drawn from N(0, /). It should however be noted that |£'n+i — En\ — >■ as the step-size Ai goes to 
zero and in practice the application of the Metropolis accept-rejection step is often not necessary unless 
At is chosen too large. The HMC method differs from MALA in that several iterations of (1321134^ are 
applied before the Metropolis accept-reject criterion is being applied. 

References 

A gentle introduction to stochastic processes can be found in [T7] and [TO]. A more mathematical 
treatment can be found in [51 and numerical issues are discussed in [3T1 [55]. See [JTJ US] for a 
discussion of the gradient flow structure of the Fokker-Planck equation. The ergodic behavior of Markov 
chains is covered in [23]. Markov chain Monte Carlo methods and the hybrid Monte Carlo method in 
particular are treated in [32]. See also 04] for the Metropolis adjusted Langevin algorithm (MALA). 

3 Recent advances in data assimilation and filtering 

In this section, we combine Bayesian inference and stochastic processes to tackle the problem of assimi- 
lating observational data into scientific models. 

3.1 Preliminaries 

We select a model written as a time-discretized SDE, such as (|17p . with the initial random variable 
satisfying Xq ~ ttq. In addition to the pure prediction problem of computing 7r„, n > 1, for given ttq, 
we assume that model states x G X = M.^ are partially observed at equally spaced instances in time. 
These observations are to be assimilated into the model. More generally, intermittent data assimilation 
is concerned with fixed observation intervals A^obs > and model time-steps At such that Atobs — LAt, 
L > 1, which allows one to take the limit L — >■ oo, At = Atohs/L. For simplicity, we will restrict the 
discussion to the case where observations yo{tn) — Yn{uj) G M.^ are made at every time step t„ = nAt, 
n > 1 and the limit At — ^ is not considered here. We will further assume that the observed random 
variables Yn satisfy the model i.e. 

Yn — h{Xn) + 

and the measurement errors S„ ^ N(0, R) arc mutually independent with common error covariance 
matrix R. We introduce the notation = {yo{U)}i=i....,k to denote all observations up to and including 
time tfc. 

Definition (Data assimilation). Data assimilation is the estimation of marginal PDFs 7r„(a;|Yfc) of the 
random variable Xn ~ X{tn) conditioned on the set of observations Y^. We distinguish three cases: (i) 
filtering k — n, (ii) smoothing k > n, and (iii) prediction k < n. 

The subsequent discussions are restricted to the filtering problem. We have already seen that evolution 
of the marginal distributions under (jl7p alone is governed by the Chapman-Kolmogorov equation (llSp 
with transition probability density (|T5| . We denote the associated Frobenius-Perron operator ([TO]) by 
Vai- Given Xq ^ ttq, we first obtain 
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This time propagated PDF is used as the prior PDF ttx — tti in Bayes' formula ^ at t — ti with 
yo — yo{ti) ^-nd likehhood 

Bayes' formula implies the posterior PDF 

7ri(a;|Yi) cx 7ry(yo(ii)|a;)7ri(a;), 
where the constant of proportionality depends on ya{ti) only. 
Proposition (Sequential filtering). The filtering problem leads to the recursion 



7r„+i(-|Y„) = 7'At7r„(-|Y„), 
7r„+i(x|Y„+i) cx 7ry(yo(in+i)|a;) 7r„+i(a;|Y„), 



(35) 



n > 0, and X„ ^ 7r„(-|Y„) solves the filtering problem at time t„. The constant of proportionality depends 
on yo{tn+i) only. 

Proof. The recursion follows by induction. □ 

Recall that the Frobenius-Perron operator Vai is generated by the stochastic difFference equation 
(fT7|) . On the other hand, Bayes' formula only leads to a transition from the predicted 7r„+i(a;|Y„) to 
the filtered 7r„_|_i(a;|Y„+i). Following our discussion on transport maps from Section 11.31 we assume the 
existence of a transport map X' — T„+i(X), depending on yo(ira+i), that couples the two PDFs. The 
use of optimal transport maps in the context of Bayesian inference and intermittent data assimilation 
was first proposed in [321 [33] . 

Proposition (Filtering by transport maps). Assuming the existence of appropriate transport maps T„+i, 
which couple 7r„_|_i(a;|Y„) and Tr„+i{x\Yn+i), the filtering problem is solved by the following recursion for 
the random variables Xn+i, n>0: 



X„+i = r„+i i^Xn + Ai/(X„) + V2A^Z„j , (36) 
which gives rise to a time- dependent Markov process. 

Proof. Follows trivially from pSI) . □ 
The rest of this section is devoted to several Monte Carlo methods for sequential filtering. 



3.2 Sequential Monte Carlo method 

In our framework, a standard sequential Monte Carlo method, also called bootstrap particle filter, may be 
described as an ensemble of random variables Xi and associated realizations (referred to as "particles" ) 
Xi = Xi{uj), which follow the stochastic difference equation ([TT)) . choosing the transport map in p6)l to 
be the identity map. Observational data is taken into account using importance sampling as discussed 
in Section [1.41 i.e., each particle carries a weight Wi{tn), which is updated according to Bayes' formula 

Wi{tn+l) CX Wi{tn)Tr{yo{tn+i)\Xi{tn+l)). 

The constant of proportionality is chosen such that the new weights {wi{tn+i)}f£i sum to one. 

Whenever the particle weights Wi{tn) start to become highly non- uniform (or possibly also after each 
assimilation step) resampling is necessary in order to generate a new family of random variables with 
equal weights. 

Most available resampling methods start from the weighted empirical measure 

M 

(dx) = ^ w^^lx, (dx) (37) 

i=l 
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associated with a set of weighted samples {xi, Wi}f£i. The idea is to replace each of the original samples 
by > offsprings with equal weights lii = 1/M. The distribution of offsprings is chosen to be equal 
to the distribution of M samples (with replacement) drawn at random from the empirical distribution 
([57)) . In other words, the offsprings {5i}f£i follow a multinomial distribution defined by 

P(e, = n.„^ = 1,...,M) = -^^ l[{w,r (38) 

11^=1 nil 

with Ui > such that — ^ ■ practice, independent resampling is often replaced by residual 

or systematic resampling. We next summarize residual resampling while we refer the reader to |3j for an 
algorithmic description of systematic resampling. 

Definition (Residual resampling). Residual resampling generates 

offsprings of each ensemble member Xi with weight Wi, i = 1, . . . , M . Here [x\ denotes the integer part 
of x and follows the multinomial distribution (j38p with weights Wi being replaced by 

_ ^ Mw, - lMw^\ 

and with a total of 

M 

^ rii = M := M - ^ [Mwi\ 

i=l i 

independent trials. 

In generalization of (I38p . we introduce the notation Mult(L; wi, . . . , wm) to denote the multino- 
mial distribution of L independent trials, where the outcome of each trial is distributed among M 
possible outcomes according to probabilities {uJi}f£i. The following algorithm draws random samples 
from Mult(i; wi, . . . , cjAf ). We first introduce the generalized inverse cumulative distribution function 
■femp • [Oj 1] ^ {Ij • • • : for the empirical measure ((37)) . which is defined by 

(i-l i 
i=i i=i 

We next draw L independent samples ui G [0, 1] from the uniform distribution U[0, 1] and initially set 
the number of copies ^i, i = I, . . . , M, equal to zero. For I — I, . . . , i, we now increment ^/j by one for 
indices /; € {!,..., M}, I — 1, . . . , L, defined by 

i 

h = ^cmV("') = argmin > ui. 

~ j=i 

Both independent and residual resampling can be viewed as providing a coupling between the empirical 
measure (|37p will all weights being equal to = 1 /M and the target measure (|37| with identical samples 
{xi} but non-uniform weights. Clearly residual resampling provides a coupling with a smaller transport 
cost. This can already be concluded from the trivial case of equal weights in the target measure in 
which case residual resampling reduces to the identity map with zero transport cost while independent 
resampling remains non-deterministic and produces a non-zero transport cost. The following example 
outlines the optimal transportation perspective on resampling more precisely for two discrete, univariate 
random variables. 

Example (Coupling discrete random variables). Let us consider two discrete, univariate random variables 
: 51 A", i = 1, 2, with target set 

X = {xi,X2, ■ . .,xm} e M*^. 
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We furthermore assume that 



P(Xi(w) = Xi) = 1/Af, ¥{X2[uj) ^ X,) = 



for given probabihties/weights wi > 0, i = 1, . . . ,M. 
matrix T G M^^x^^ such that = {T)^J > and 



Given a couphng T and the mean 



the covariance between Xi and X2 is defined by 



Any couphng of Xi and X2 is characterized by a 



U M 

^ = 1/M, ^ = Wi 

values 



cov(Xi,X2) = ^(a^i - X2)tij{xj - xi). 

The induced Markov transistion matrix from Xi to X2 is simply given by MT- Independent resampling 
corresponds to tij = Wi/M and leads to a zero correlation between Xi and X2. On the other hand, 
maximizing the correlation results in a linear programming problem for the unknowns {tij}. Its 
solution then also defines the solution to the associated optimal transportation problem ([S]). 

More generally, sequential Monte Carlo methods differ by the way resampling is implemented and 
also in the choice of proposal step which in our context amounts to choosing transport maps r„+i in psp 
which are different from the identity map. See also the discussion in Section [3.51 below. 



3.3 Ensemble Kalman filter (EnKF) 

We now introduce an alternative to sequential Monte Carlo methods which has become hugely popular 
in the geophysical community in recent years. The idea is to construct a simple but robust transport 
map T'^j^i which replaces T„+i in p6p . This transport map is based on the Kalman update equations for 
linear SDEs and Gaussian prior and posterior distributions. We recall the standard Kalman filter update 
equations. 

Proposition (Kalman update for Gaussian distributions). Let the prior distribution ttx be Gaussian 
with mean x^ and covariance matrix . Observations yo are assumed to follow the linear model 

Y = HX + E, 

where S ~ N(0, i?) and R is a symmetric, positive- definite matrix. Then the posterior distribution 
TTx{x\yo) is also Gaussian with mean 

= - pfH'^iHPfH'^ + R)-^{Hxf - yo) (39) 

and covariance matrix 

P°- ^ pf ~ pl H^{HPf + ny^Hpf . (40) 

Here we adopt the standard meteorological notation with superscript f (forecast) denoting prior statistics, 
and superscript a (analysis) denoting posterior statistics. 

Proof. By straightforward generalization to vector- valued observations of the case of a scalar observation 
already discussed in Section [TT2l □ 

EnKFs rely on the assumption that the predicted PDF 7r„+i(x|Y„) is approximately Gaussian. The 
ensemble {xi}f£i of model states is used to estimate the mean and the covariance matrix using the empir- 
ical estimates ([^5]) and ([5(11) , respectively. The key novel idea of EnKFs is to then interpret the posterior 
mean and covariance matrix in terms of appropriately adjusted ensemble positions. This adjustment can 
be thought of as a coupling of the underlying prior and posterior random variables of which the ensembles 
are realizations. The original EnKF ^ uses perturbed observations to achieve the desired coupling. 
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Definition (Ensemble Kalman Filter). The EnKF with perturbed observations for a linear observation 
operator h{x) = Hx is given by 

^n+l = + At/(X„) + V2AlZ„, (41) 

X„+i = Xi^^ - pI^,H^{HPI^,H^ + R)-\HXi^, - yo + S„+i), (42) 

where the random variables Z„ ^ N(0, Q), S„+i ~ N(0, i?) are the mutually independent perturbations 
to the observations, yo — Voitn+i), — E^/ [a:], and 

Next, we investigate the properties of the assimilation step (j42p . 



Proposition (EnKF consistency). The EnKF update step 1^2^ propagates the mean and covariance 
matrix of X in accordance with the Kalman filter equations for Gaussian PDFs. 

Proof. It is easy to verify that the ensemble mean satisfies 

which is consistent with the Kalman filter update for the ensemble mean. Furthermore, the deviation 
5X = X — X satisfies 

<5X„+i = SXl^, ~ pI^,H^{HP,{^,H^ + R)-\H5Xi^^ + S]„+i), 

which implies 

Pn+i = - 2Pi^,H^{HPl^,H^ + R)-'HpI^,+ 

pI+,H^{HPI^,H^ + R)-'R{HPl^,H^ + R)-'HPl^,+ 

{HPl^^H^ + R)-^HPI^^H^{HPI^,H^ + R)-'HPl^, 

= Pl+i - Pl+iH'^iHPl+.H^ + Rr'HPl^, 

for the update of the covariance matrix, which is also consistent with the Kalman update step for Gaussian 
random variables. □ 

Practical implementations of the EnKF with perturbed observations replace the exact mean and 
covariance matrix by ensemble based empirical estimates (|28p and pop . respectively. 

Alternatively, we can derive a transport map T under the assumption of Gaussian prior and posterior 
distributions as follows. Using the empirical ensemble mean x we define ensemble deviations by Sxi = 
Xi — X G and an associated ensemble deviation matrix — {Sxi, . . . , Sxi\i) £ R^^^^. Using the 
notation, the empirical covariance matrix of the prior ensemble at tn+i is then given by 



We next seek a matrix S G ra^xa^ such that 



^"+1 - M - 1 



where the rows of S sum to zero in order to preserve the zero mean property of <5X„+i — 6Xl^_^_^S. Such 
matrices do exist (see e.g. [TS]) and give rise the ensemble square root filters. More specifically, Kalman's 
update formula (001) for the posterior covariance matrix implies 

P- = j^^SXf |/ - -^iSYff [HpfH^ + R] ' 6Yf^ {SXff 

- ^ sxfss^isxff, 



M - 1 
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where we have dropped the time index subscript and introduced the ensemble perturbations SY-^ — HSy^-^ 
in observation space y. Recalhng now the definition of a matrix square root from Section [1.31 and making 
use of the Sherman-Morrison- Woodbury formula [T51 , we find that 

The complete ensemble update of an ensemble square root filter is then given by 

Xi(tnJr\) = + SXIj^^SGi, (44) 

where denotes the ith basis vector in K.*^ and 

x„+i = - pI+,H^{HPI^,H^ + R)-\Hxl^, - Vo{tn+i)) 

denotes the updated ensemble mean. 

We now discuss the update (|44p from the perspective of optimal transportation which in our context 
reduces to finding a matrix S'qt G such that the trace of 

cov(<5X,{+i,<5X„+i) = n5Xi^^Sl^{5Xi^^)^) 

is maximized. 

Proposition (Optimal update for ensemble square root filter). The trace of the covariance matrix 
cov{SXf^^-^^, 5Xn+i) is maximized for 



with transform matrix 



VM - 1 



SXn+l — 5Xl_^_^SoT 



S{SX[^,fpfSX[^,s] '^'^ S{6Xl^,f5Xl^, 



and S e R^^x*^ given by (g^. 

Proof Follows from © with A SX[_^_^S/VM - 1 and Si = P-'^. The left multiphcation in © is finally 
rewritten as a right multiplication by Sqt G M^^^*^ in terms of ensemble deviations SX^^^^. □ 

Wc finish this section by brief discussions on a couple of practical issues. It is important to recall 
that the Kalman filter can be viewed as a linear minimum variance estimator |14) . At the same time it 
has been noted [511 13D] that the updated ensemble mean Xn+i is biased in case the prior distribution is 
not Gaussian. Hence the associated mean squared error (ITU)) does not vanish as M — > cx) even though 
the variance of the estimator goes to zero. If desired the bias can be removed by replacing Xn+i in (I44p 
by ([TT|) with weights (|T^ . where yo = yo{tn+i) and x^"°'^ — x{{tn+i)- Higher-order moment corrections 
can also be implemented [521 1301 . However, the filter performance only improves for sufficiently large 
ensemble sizes. 

We mention the unscented Kalman filter [2 3) as an alternative extension of the Kalman filter to 
nonlinear dynamical systems. We also mention the rank histogram filter P], which is based on first 
constructing an approximative coupling in the observed variable y alone followed by linear regression of 
the updates in y onto the state space variable x. 

Practical implementations of EnKFs for high-dimensional problem rely on additional modifications, in 
particular inflation and localization. While localization modifies the covariance matrix pf in the Kalman 
update (j42p in order to increase its rank and to localize the spatial impact of observations in physical 
space, inflation increases the ensemble spread 5xi — Xi — x hy replacing Xi by a; -I- a(xi — x) with a > 1. 
Note that the second term on the righthand side of ([3T|) achieves a similar effect and ensemble inflation 
can be viewed as simple parametrization of (stochastic) model errors. See |il5j for more details on inflation 
and localization techniques. 
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3.4 Ensemble transform Kalman-Bucy filter 

In this section, we describe an alternative implementation of ensemble square root filters based on the 
Kalman-Bucy filter. We first describe the Kalman-Bucy formulation of the linear filtering problem for 
Gaussian PDFs. 

Proposition (Kalman-Bucy equations). The Kalman update step i39\) - (J(^ can be formulated as a dif- 
ferential equation in artificial time s G [0, 1]. The Kalman-Bucy equations are 

f = -PH^R-\Hx~yo) 
as 

and 

— = -PH^R-^HP. 
as 

The initial conditions are 5(0) = and -P(O) = P^ and the Kalman update is obtained from the final 
conditions x"" = and P" = ^(1)- 

Proof. We present the proof for = 1 (one dimensional state space) and K = 1 {& single observation). 
Under this assumption, the standard Kalman analysis step p9p - (PDl) gives rise to 

pa ^ -a ^ xfR + ypPf 

Pf + R' Pi+R ' 

for a given observation value yo. 

We now demonstrate that this update is equivalent to twice the application of a Kalman analysis step 
with R replaced by 2R. Specifically, we obtain 

pa_ ^PrnR p _ 2Pf R 



for the resulting covariance matrix P"^ with intermediate value Pm- The analyzed mean f is provided 

by 

yoPm - 2x^R -\- yoP^ 
^ ~ Pm + 2R ' ~ Pf + 2R 

We need to demonstrate that P"^ — P°- and x°- — x°- . We start with the covariance matrix and obtain 

pa ^ TniuR ^ ^PfR' ^ _P!R_ ^ pa 

^P1R. + 2R APSR + AR-^ Pf + R 
A similar calculation for x"" yields 

.a ^ ^ K'fXlf R + yowi^. ^ AxfR^+AyoPfR ^ 
2R-^^0^ AR^ + ARPf 

Hence, by induction, we can replace the standard Kalman analysis step by D > 2 iterative applications 
of a Kalman analysis with R replaced by DR. We set Pq = P^ , xq = x^ , and iteratively compute P,+i 
from 



DPjR ^ _DxjR + yoPj 



Pj+i — n , no' + 1 

We final 

assume D » 1. Then 



Pj+DR' Pj+DR 
ior j — 0, . . . , D — 1. We finally set P° = and x"- — xd- Next we introduce a step-size As ~ 1/D and 



= '''pllf^p = % - A.P,i?-^ {X, - yo) + 0{As') 

as well as 

P,.. = j^^ = P,-AsP,R-^P,+OiAs^). 
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Taking the limit As — >■ 0, we obtain the two differential equations 

— = -PB-^P, — = -PR-^ {x - yo) 
as as 

for the covariance and mean, respectively. The equation for P can be rewritten in terms of its square 
root Y {i.e. P = F^) as 

^=-lpR-^Y. (45) 
as 2 

□ 



Upon formally setting Y = SX/y^M — 1 in (|45p . the Kalman-Bucy filter equations give rise to a 
particular implementation of ensemble square root filters in terms of evolution equations in artificial time 

[0,1]. 

Definition (Ensemble transform Kalman-Bucy filter equations). The ensemble transform Kalman-Bucy 
filter equations OEKT] for the assimilation of an observation yo — yo{tn) at t„ are given by 

dx- 1 

- --PH^B-\Hx, + Hx~ 2yo(i„)) 
as 2 

in terms of the ensemble members Xi, i = 1 , . . . , M and are solved over a unit time interval in artificial 
time s G [0,1]. Here P denotes the empirical covariance matrix ([50)1 and x the empirical mean (1^51) of 
the ensemble. 

The Kalman-Bucy equations are realizations of an underlying differential equation 

^^-^PH^B-\HX + Hx^2yo{tn)) (46) 
as 2 

in the random variable X with mean 



X = Ex [a;] — I xT^xdx 



and covariance matrix 

P = Ex[{x - x){x - x)"^]. 

The associated evolution of the PDF ttx (here assumed to be absolutely continuous) is given by 
Liouville's equation 

V, • i^xv) (47) 



diTx 



ds 

with vector field 

vix) = -^PH^R-\Hx + Hx~ 2yoit„)). (48) 

Recalling the earlier discussion of the Fokker-Planck equation in Section (12. 2p . we note that (|17)) with 
vector field also has an interesting geometric structure. 

Proposition (Ensemble transform Kalman-Bucy equations as a gradient flow). The vector field is 
equivalent to 

v{x) = -PV^^ 

OTTX 

with potential 

Pi^x) - 7 / {Hx- yo{tn) fB-\Hx - yo{tn))TTxdx+ 



4 



^{Hx - yo{tn)fB-\Hx ~ yo{tn))- (49) 



Liouville's equation ( |.^7| ) can be stated as 

dnx 



■ {t^xv) = -grad^^i^(7rx), 



ds 

where we have used M — P in the definition of the gradient 
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Proof. The result can be verified by direct calculation. □ 

Nonlinear forward operators can be treated in this framework by replacing the potential (|49p by, for 
example, 

FiiTx) = 7 / (hix) - yoiU,))^ R-\h{x) - 2/o(^n))7rxdx+ 

4 Jen 

- yo{tn)fR~\Hx - 2/o(t„)). 

Efficient time-stepping methods for the ensemble transform Kalman-Bucy filter equations are discussed 
in [1] and an application to continuous data assimilation can be found in [6 . 



3.5 Guided sequential Monte Carlo methods 

EnKF techniques are limited by the fact that the empirical PDFs do not converge to the filter solution 
in the limit of ensemble sizes M — >■ oo unless the involved PDFs are Gaussian. Sequential Monte Carlo 
methods, on the other hand, lead to unbiased estimators for the mean and can be shown to converge under 
fairly general assumptions, but they do not work well in high-dimensional phase spaces since importance 
sampling is not sufficient to guarantee good performance of a particle filter for finite ensemble sizes. In 
particular, the variance in the associated mean squared error (llOp can be very large for ensemble sizes 
typically used in geophysical applications. 

The combination of modified particle positions and appropriately ajdusted particle weights appears 
therefore as a promising area for research and might achieve a better bias-variance tradeoff than either 
the EnKF or traditional sequential Monte Carlo methods. In particular, combining ensemble transform 
techniques, such as EnKF, with sequential Monte Carlo methods appears as a natural research direction. 
Indeed, in the framework of Monte Carlo methods discussed in Section [L4l the standard sequential Monte 
Carlo approach consists of importance sampling using proposal PDF ir'-^ix) — 7r„4.i(a:|Y„) and subsequent 
reweighting of particles according to (IT^ . As also already discussed in Section 11.41 the performance of 
importance sampling can be improved by applying modified proposal densities 7r^_|_]^(a;|Y„+i) with the 
aim of pushing the updated ensemble members Xi{tn+i) to regions of high and nearly equal probability in 
the targeted posterior PDF 7r„-|.i(a;|Y„+i) (compare with eq. (HH)). We call the resulting filter algorithms 
guided sequential Monte Carlo methods. 

More precisely, a guided sequential Monte Carlo method is defined by a conditional proposal PDF 
Trn+i{x'\x,yo{tn+i)) and an associated joint PDF 

7rx'x(a;',a:|Y„+i) = 7r„+i (a;'|a;, j/o(in+i)) 7r„(a;|Y„). (50) 

An ideal proposal density (in the sense of coupling) would lead to a marginal distribution TTx'{x\Yn+i), 
which is identical to the posterior PDF 7r„+i(a;|Y„+i). In guided sequential Monte Carlo methods, 
a mismatch between ttx' (a;|Y„+i) and 7f„+i (a;|Y„+i) is treated by adjusted particle weights Wi(i„+i). 
Following the general methodology of importance sampling one obtains the recursion 

. TrY{yoitn+l)\x'i}T:{x'i\x,) 
Wi{tn+l) CX — : — -—W^(tn). 

TTn+l(X^\x„yo[tn+l)) 

Here 7r(a;'|x) denotes the conditional PDF (fT5)) describing the model dynamics, (x^, Xi), i = 1, . . . , M, are 
realizations from the joint PDF (|50p with weights Wi{tn), Xi = Xi{tn), and the approximation 



Ex„+i[5] = — u v: / / f{x\x)Trx'x{x\x\Yn+i)dx'Ax 

7ry(2/o(i«+i)) Jr" ' - 



1 



7ry(j/o(in+i)) 

M 

CX V w,(t„+i).g(a;^) 



M 

^m{tn)f{x'„X^) 



i=l 
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with 



Try {yo{t,i+i)\x')Tr{x'\x) 

TT,i+i{x'\x,yo{tn+l)) 



has been used. The guided sequential Monte Carlo method is continued with Xi{tn+i) = x[ and new 
weights Wi{tn+i). 

Numerical implementations of guided sequential Monte Carlo methods have been discussed, for ex- 
ample, in [28j [21 [TTJ [35]. More specifically, a combined particle and Kalman filter is proposed in [28] 
to achieve almost equal particle weights (see also the discussion in 7 ) , while in (TT] [35] , new particle 
positions Xi{tn+i) are defined by means of implicit equations. We emphasize that both implementation 
approaches give up the requirement of unbiased estimation in hope for reduced variance at finite ensemble 
sizes and hence for an overall reduction of the associated mean squared error (jlOp . 

Another broad class of methods is based on Gaussian mixture approximations to the prior PDF 
7r„+i(a;|Y„). Provided that the forward operator h is linear, the posterior PDF 7r„+i(x|Y„_(_i) is then also 
a Gaussian mixture and several procedures have been proposed to adjust the proposals x{(t„-fi) such 
that the adjusted Xi{tn+i) approximately follow the posterior Gaussian mixture PDF. See, for example, 
[ini [IB HZ] ■ Broadly speaking, these methods can be understood as providing approximate transport 
maps T^_|_i instead of an exact transport map T„+i in (j36p . However, none of these methods avoid the 
need for particle reweighting and resampling. Recall that resampling can be implemented such that it 
corresponds to a non-deterministic optimal transference plan. 

The following section is devoted to an embedding technique for constructing accurate approximations 
to the transport map r„+i in (|36p . 

3.6 Continuous ensemble transform filter formulations 

The implementation of ([36]) requires the computation of a transport map T. Optimal transportation 
(i.e., maximising the covariance of the transference plan), leads to T = V xip s-nd the potential satisfies 
the highly nonlinear, elliptic Monge- Ampere equation 



A direct numerical implementation for high-dimensional state spaces X = seems at present out 
of reach. Instead, in this section we utilize an embedding method due to Moser j37] . replacing the 
optimal transport map by a suboptimal transport map which is defined as the time-one flow map of a 
differential equation in artificial time s G [0, 1]. At each time instant, determining the right hand side 
of the differential equation requires the solution of a linear elliptic PDE; nonlinearity is exchanged for 
linearity at the cost of suboptimality. In some cases, such as Gaussian PDFs and mixtures of Gaussian, 
the linear PDE can be solved analytically. In other cases, further approximations, such as a mean field 
approach discussed later in this section, are necessary. 

Inspired by the embedding method of Moser |37j , we first summarize a dynamical systems formulation 
[32] of Bayes' formula which generalizes the continuous EnKF formulation from Section [5^ We first note 
that a single application of Bayes' formula ^ can be replaced by an D-fold recursive application of the 
incremental likelihood tt: 



where the constant of proportionality depends only on y^, and then consider the implied iteration 




(51) 



i.e., we first write Bayes formula as 



D 



TTj+l{x) 



nj{x)TT{yn\x) 



/r dxT:j{x)^{yo\x) 
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with ttq = TTx and 'Kx{'\yn) = t^d- We may now expand the exponential function in (|5ip in the small 
parameter As ~ ^/D, in the limit D ^ oo obtaining the evolution equation 

£ = - - {h{x) - yof {h{x) -yo)TT + f^n (52) 

in the fictitious time s G [0, 1]. The scalar Lagrange multiplier /i is equal to the expectation value of the 
negative log likelihood function 

L{x; yo) = ^ {h{x) - yof {h{x) - yo) (53) 

with respect to tt and ensures that J^jv {dTr/ds)dx = 0. We also set 7r(a;, 0) = Trx{x) and obtain 7rx(a;|j/o) = 
Tr{x, 1). 

We now rewrite (j52p in the equivalent, but more compact, form 

-^ = -Tr(L-Z), where Z = Ex[L]. (54) 
OS ' 

Here Ex denotes expectation with respect to the PDF ttx = 7r(-, s). It should be noted that the continuous 
embedding defined by ((M)) is not unique. Moser |37], for example, used the linear interpolation 

7r(a;, s) = (1 - s)-kx{x) + s7rx(2;|yo), 

which results in 

— = 7rx(x|?/o) - 7rx(a;). (55) 
OS 

Yet another interpolation is given by the displacement interpolation of McCann which is based on the 
optimal transportation map and which has an attractive "fluid dynamics" interpretation [491 150) . 

Eq. ([5l| (or, alternatively, ([55]) ) defines the change (or transport) of the PDF tt in fictitious time 
s e [0, 1]. Alternatively, following Moser's work [371 we can view this change as being induced by a 
continuity (Liouville) equation 

1^ - -V, • (^5) (56) 
OS 

for an appropriate vector field g{x,s) S E^. 

At any time s G [0, 1] the vector field g{-, s) is not uniquely determined by (|5^ and unless we 
also require that it is the minimizer of the kinetic energy 

T{v) = ^ / TTV^M-'^vdx 

2 jRiV 

over all admissible vector fields v : — > {i.e. g satisfies for given tt and dir/ds), where 
M € M^^^ is a positive definite matrix. Under these assumptions, minimization of the functional 

C[v,(l)] = ^ [ '!Tv'^M~^vdx+ [ (^{^ + Vx • (7rw)ldx 

for given tt and dn/ds leads to the Euler-Lagrange equations 

TTM-ig - 7rV:,V = 0, |^ + • (7r.g) = 

in the velocity field g and the potential Hence, provided that tt > 0, the desired vector field is given 
hy g = MVx^, and we have shown the following result. 

Proposition (Transport map from gradient flow). // the potential •ip{x,s) is the solution of the elliptic 
PDE 

V, • (ttxMV,^) = 7rx(L-L), (57) 
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then the desired transport map x' — T(x) for the random variable X with PDF t:x{x, s) is defined by the 
time- one- flow map of the differential equations 



— = ~MV,V- 
as 

The continous Kalman-Bucy filter equations correspond to the special case M = P and ip = SF/Sttx with 
the functional F given by J^ffp . 

The elliptic PDE ([57)1 can be solved analytically for Gaussian approximations to the PDF ttx and the 
resulting differential equations are equivalent to the ensemble transform Kalman-Bucy equations (|46p. 
Appropriate analytic expressions can also be found in case where ttx can be approximated by a Gaussian 
mixture and the forward operator h{x) is linear (see [43] for details). 

Gaussian mixtures are contained in the class of kernel smoothers. It should however be noted that 
approximating a PDF ttx over high-dimensional phase spaces X = using kernel smoothers is a 
challenging task, especially if only a relatively small number of realizations Xi, i — 1, . . . , M, from the 
associated random variable X are available. 

In order to overcome this curse of dimensionality, we outline a modification to the above continuous 
formulation, which is inspired by the rank histogram filter of Anderson [2]. For simplicity of exposition, 
consider a single observation y G M with forward operator h : — )• R. We augment the state vector 
X G R^ by y = h{x), i.e. we consider {x,y) and introduce the associated joint PDF 

■^XY{x,y) ^ ^^x{x\y)^^Y(y)■ 
We apply the embedding technique first to y alone resulting in 



dy 
ds 



with 

dy{'KY{y)fy{y)) = '^Y{y){L - L). 

One then finds an equation in the state variable x E M.^ from 

Vo; • {TTx{x\y)fa,{x,y,s)) fy{y,s)dyTTx{x\y) = 

and 

dx 

-1- = fx(x,y,s). 
ds 

Next we introduce the mean field approximation 

7T,ix'\y)7r2{x^\y)---TTN{x^\y) (58) 

for the conditional PDF ttx {x\y) with the components of the state vector written as a; = (x^, x^, . . . , x'^)'^ € 
R''^. Under the mean field approximation the vector field fx = (/a;i, /x^, • ■ • , /x")"^ can be obtained 
component-wise by solving scalar equations 

d,{'Kkiz\y)f^k{z,y)) + fy{y) dyTTk{z\y) = 0, (59) 

k= 1,...,N, for f^k{z,y) with z — x'^ G R. The (two-dimensional) conditional PDFs Trk{x''\y) need to 
be estimated from the available ensemble members Xi € R^ by either using parametric or non-parametric 
statistics. 

We first discuss the case for which both the prior and the posterior distributions are assumed to be 
Gaussian. In this case, the resulting update equations in x € R^ become equivalent to the ensemble 
transform Kalman-Bucy filter. This can be seen by first noting that the update in a scalar observable 
y g R is 
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Furthermore, if the condition PDF TTk{z\y), z = x*^ e M, is of the form then (|59p leads to 

f^k{x'',y) = Crly(Jyyfy{y), 

which, combined with the approximation ([55]) . resuhs in the continuous ensemble transform Kalman-Bucy 
filter formulation discussed previously. 

The rank histogram filter of Anderson [2 corresponds in this continuous embedding formulation to 
choosing a general PDF Try (y) while a Gaussian approximation is used for the conditional PDFs Tr^ix'' |y). 

Other ensemble transform filters can be derived by using appropriate approximations to the marginal 
PDF TTy and the conditional PDFs 7Tk{x''\y), k = 1,...,N, from the available ensemble members Xi, 
i = 1,...,M. 

References 

An excellent introduction to filtering and Bayesian data assimilation is [32]. The linear filter theory 
(Kalman filter) can, for example, be found in [3S]. Fundamental issues of data assimilation in a me- 
teorological context are covered in [25]. Ensemble filter techniques and the ensemble Kalman filter are 
treated in depth in [TS] . Sequential Monte Carlo methods are discussed in [T31 HJ |3] and by [371 H] in 
a geophysical context. See also the recent monograph [H]. The transport view has been proposed in 
[T3] for continuous filter problems and in [321 intermittent data assimilation. Gaussian mixtures are 
a special class of non-parametric kernel smoothing techniques which are discussed, for example, in [51j . 

4 Concluding remarks 

We have summarized the Bayesian perspective on sequential data assimilation and filtering in particular. 
Special emphasize has been put on discussing Bayes' formula in the context of coupling of random 
variables, which allows for a dynamical system's interpretation of the data assimilation step. Within 
a Bayesian framework all variables are treated as random. While this implies an elegant mathematical 
treatment of data assimilation problems, any Bayesian approach should be treated with caution in the 
presence of sparse data, high-dimensional model problems, and limited sample sizes. It should be noted 
in this context that successful assimilation techniques such as 4DVar (not covered in this survey) and the 
EnKF lead to biased approximations to the state estimation problem. In both cases the bias is due to 
the fact that the algorithms are derived under the assumption that the prior distributions are Gaussian. 
Nevertheless 4DVar and EnKF work often well in terms of the observed mean squared error (|10p since 
the variance of the estimator remains small even for relatively small ensemble sizes M . On the contrary, 
asymptotically unbiased Bayesian approaches such as sequential Monte Carlo methods suffer from the 
curse of dimensionality, lead generally to large variances in the estimators for small M and have therefore 
not yet found systematic applications in operational forecasting, for example. To overcome this limitation, 
one could consider more suitable proposal steps such as guided sequential Monte Carlo methods and/or 
impose certain independence assumptions such as mean field approximations which lead to an improved 
balance between bias and variance in the mean squared error (|10l) . See also the discussion of [20] on the 
bias-variance tradeoff in the context of supervised learning. Promising results for guided particle filters 
have been reported very recently in [341129) . Alternatively, non-Bayesian approaches to data assimilation 
could be explored in the future such as: (i) shadowing for partially observed reference solutions, (ii) a 
nonlinear control approach with transport maps as dynamic feedback laws, (iii) derivation and analysis 
of ensemble filter techniques within the framework of stochastic interacting particle systems. 
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